Skip to content

[iris] Consolidate controller store layer around process-scoped stores#4836

Open
rjpower wants to merge 1 commit intomainfrom
claude/busy-beaver
Open

[iris] Consolidate controller store layer around process-scoped stores#4836
rjpower wants to merge 1 commit intomainfrom
claude/busy-beaver

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 16, 2026

Summary

Reshape the controller's persistence layer so every query/mutation flows through process-scoped store objects instead of ad-hoc inline SQL. Four consolidations land together because they share caches, call sites, and invariant boundaries.

§1 — Endpoint cache becomes a store

EndpointRegistry moves from its own module into store.py as EndpointStore, reusing the same on_commit write-through hook pattern as the other stores. endpoint_registry.py is deleted.

§2 — Stores are persistent and owned by ControllerDB

Stores are no longer reconstructed per-transaction. Each is built once in ControllerDB.__init__ and reused across all transactions, with caches living on the store instance. ControllerDB exposes tasks, jobs, workers, dispatch, meta, users, reservations, endpoints properties.

§3 — DbContext bundles cursor + stores

New DbContext / DbReadContext frozen dataclasses bundle a cursor with every store. Transition helpers that used to take a raw cursor now take ctx: DbContext, so callers can reach any store by name without plumbing. ControllerDB.transaction() is replaced by transact() / read_ctx().

§4 — TaskFilter reifies the "simple task read" query shape

28 inline SELECT FROM tasks sites across transitions.py collapse into self._db.tasks.query(ctx.cur, TaskFilter(...)) with chunked IN-list handling for the SQLite host-parameter cap and optional jobs / job_configs joins. EndpointQuery moves to store.py alongside it.

Net: +779 / -2338 LOC.

Test plan

  • uv run pytest lib/iris/tests/cluster/controller/test_task_store_query.py -x — 17 new tests covering TaskFilter validation, empty-list short-circuit, chunking + limit interaction, and the join variants
  • Full lib/iris test suite: 793 passing
  • ./infra/pre-commit.py --all-files --fix clean (ruff, black, pyrefly)

@rjpower rjpower added the agent-generated Created by automation/agent label Apr 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe79b17576

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +876 to +879
"UPDATE task_attempts SET state = ?, "
"finished_at_ms = COALESCE(finished_at_ms, ?), error = NULL "
"WHERE task_id = ? AND attempt_id = ?",
(int(r.attempt_state), r.now_ms, r.task_id, r.attempt_id),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve attempt error and exit code during requeue

This requeue path clears task_attempts.error and never records exit_code, so retriable FAILED/WORKER_FAILED updates lose the diagnostics for the finished attempt. Before this refactor, the transition path persisted attempt-level error/exit metadata even when the task was moved back to PENDING; now tasks_for_job_with_attempts and related dashboard/debug flows can no longer show why the previous attempt failed.

Useful? React with 👍 / 👎.

Comment on lines +809 to +810
self._endpoint_store._load_all()
self._worker_store.reset_attr_cache()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear job-config cache after replace_from restore

replace_from() reloads endpoint and worker caches after swapping SQLite files, but it does not invalidate the new process-scoped JobStore config cache. If any job config was cached before restore, later reads can keep using stale config rows from the old DB, which can skew post-restore transition logic (timeouts/resources/retry policy) until process restart.

Useful? React with 👍 / 👎.

rjpower added a commit that referenced this pull request Apr 16, 2026
Phase 1 + Phase 3 follow-up to the store consolidation (PR #4836).

Phase 1 — dead code and correctness:
- Remove ControllerDB.read_ctx()/DbReadContext (0 callers).
- Remove dead NotImplementedError branch in _filter_id_chunks for
  TaskFilter.task_ids + job_ids (no caller uses both).
- Delete WorkerStore._attr_cache. Post-denormalization worker_attributes
  has a compound PK and <100 healthy-active workers per tick — an
  indexed read is cheaper than the in-memory mirror, and the old cache
  was buggy anyway (prune_old_data skipped invalidation, mutations
  happened outside on_commit). Rewrite healthy_active_with_attributes
  to read attributes in the same read snapshot.
- Fix JobStore.get_config negative-cache pinning: don't cache misses.

Phase 3 — structural simplification:
- Delete MetaStore. The two last_submission_ms accessors become
  ControllerDB.get_counter/set_counter beside the existing
  next_sequence helper. The meta table itself stays (needed for
  priority_insertion sequencing and scheduler sort order).
- Delete PruneStore. Redistribute to the stores that own the tables:
  prune_task_resource_history → TaskStore;
  prune_per_worker_history(table=...) → WorkerStore.prune_task_history
  + WorkerStore.prune_resource_history (split by table, no string dispatch);
  get_inactive_worker_before → WorkerStore;
  get_finished_jobs_before → JobStore;
  batch_delete → module-level helper in db.py.
- Fold ReservationStore into DispatchStore (one method, moved
  replace_claims and converted the per-row INSERT loop to executemany
  while there).

Net: -170 lines. Tests: 792 passing in lib/iris/tests/cluster/controller.
@rjpower rjpower requested a review from ravwojdyla April 16, 2026 23:25
Restructure the controller storage layer into three distinct layers with
clear ownership boundaries:

- db.py: pure SQLite infrastructure (TransactionCursor, QuerySnapshot,
  read pool). No domain knowledge; no store ownership.
- schema.py: DDL (Table/Column) plus row dataclasses. Introduces a
  @Projection decorator and pcolumn() field helper that derive the
  column→dataclass mapping from the dataclass definition itself,
  eliminating ~160 lines of hand-written decode boilerplate and unifying
  the three TaskDetailRow query variants on one decoder.
- store.py: per-table domain stores (TaskStore / JobStore / WorkerStore
  / EndpointStore / DispatchStore / UserStore) plus ControllerStores,
  the process-scoped container. `ControllerStores.from_db(db)` is the
  single entry point; `transact()` and `read()` hand out a per-scope
  ControllerStore bundling a Cursor and the stores, and the caller can
  pass either `ctx` (an object with `.cur` and the stores) or `ctx.cur`
  into store methods. Read scope hands out a read-only connection, so
  any write raises OperationalError — locked in by a regression test.

The inversion (stores own db, not the other way around) removes the
prior circular setup where ControllerDB constructed and attached its
stores. Stores no longer carry `self._db`; every method takes an
explicit cursor. The `JobStore.attach_tasks` fixup for the TaskStore ↔
JobStore cycle is replaced by passing `tasks: TaskStore` into the two
methods that need it.

Deletes that fall out of the refactor:
- endpoint_registry.py: folded into EndpointStore
- constraints.py: unused
- MetaStore / PruneStore / ReservationStore + the worker attribute cache:
  all dead or better expressed as plain store methods
- test_reservation_holder_reset_regression.py: guarded a cache that no
  longer exists

Tier-1 cleanups landed alongside:
- sql_placeholders(n) helper (24 sites)
- WhereBuilder for building optional AND-joined predicates (3 sites)
- AttemptFinalizer.build(...) classmethod (7 of 8 sites)
- Test helpers (`*_for_test` methods) moved from ControllerTransitions
  into tests/cluster/controller/_testing.py where they belong.

ControllerTransitions, ControllerServiceImpl, ActorProxy, the dashboard,
and the benchmark script are all rewired to take ControllerStores
instead of ControllerDB. Tests updated accordingly; the remaining
`state._db` references are intentional (infra calls like
`db.snapshot()` / `db.wal_checkpoint()`).
@rjpower rjpower force-pushed the claude/busy-beaver branch from 417dbb9 to c1a79e1 Compare April 17, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant